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Nesterov's accelerated gradient methods 
(AGM) have been successfully applied in 
many machine learning areas. However, 
their empirical performance on training max- 
margin models has been inferior to existing 
specialized solvers. In this paper, we first ex- 
tend AGM to strongly convex and compos- 
ite objective functions with Bregman style 
prox-functions. Our unifying framework cov- 
ers both the oo-memory and 1-memory styles 
of AGM, tunes the Lipschiz constant adap- 
tively, and bounds the duality gap. Then 
we demonstrate various ways to apply this 
framework of methods to a wide range of ma- 
chine learning problems. Emphasis will be 
given on their rate of convergence and how 
to efficiently compute the gradient and opti- 
mize the models. The experimental results 
show that with our extensions AGM outper- 
forms state-of-the-art solvers on max-margin 
models. 



min J(w) := Afi(w) + -R C mp(w) 

regularize! empirical risk 
1 2 

with Q(w) := — ||w|j 2 



(1) 



(2) 



1 n 

■Remp(w) := - maxV[l - 2/,-((w,Xi) + b)]+, (3) 

Tl 



where [x]+ = x if x > and otherwise. Here we 
assume access to a training set of n labeled examples 
{(xj,2/i)}f =1 where Xj £ W and e {-1,+1}, and 
use the half square Euclidean norm ||w|| 2 = ^ w? as 
the regularizer. The parameter A controls the trade-off 
between the empirical risk and the regularizer. 

There has been significant research devoted to devel- 
oping specialized optimizers which minimize J(w) effi- 
ciently. Zhang et al. [1] proved that cutting plane and 
bundle methods may require at least 0(np/e) compu- 
tational efforts to find an e accurate solution to (1), 
and they suggested using Nesterov's accelerated gra- 
dient method (AGM) which provably costs 0(np/^/e) 
time complexity. In general, AGM takes 0(1/ y/e) 
times of gradient query to find an e accurate solution 
to 



1. Introduction 

There has been an explosion of interest in machine 
learning over the past decade, much of which has been 
fueled by the phenomenal success of binary Support 
Vector Machines (SVMs). Driven by numerous appli- 
cations, recently, there has been increasing interest in 
support vector learning with linear models. At the 
heart of SVMs is the following regularized risk mini- 
mization (RRM) problem: 



min/(x), 



(4) 



where / is convex and has L-Lipschitz continuous gra- 
dient (L-l.c.g), and Q is a closed convex set in the 
Euclidean space. AGM is especially suitable for large 
scale optimization problems because each iteration it 
only requires the gradient of /. 

Unfortunately, despite some successful application of 
AGM in learning sparse models [2, 3] and game playing 
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[4], it does not compare favorably to existing special- 
ized optimizers when applied to training large margin 
models [5]. It turns out that special structures exist 
in those problems, and to make full use of AGM, one 
must utilize the computational and statistical proper- 
ties of the learning problem by properly reformulating 
the objectives and tailoring the optimizers accordingly. 

To this end, our first contribution is to show that in 
both theory and practice smoothing i? emp (w) as in [6] 
is advantageous to the primal-dual versions of AGM. 
The dual of (1) is 

max£)(a) = ^ <x t - -^-a T YX T XYa, (5) 

i 

s.t. aeQ 2 :={ae [0, n" 1 ]" : ^ y t a t = o}. (6) 

i 

Comparing (4) with (1) and (5), it seems more natural 
to apply AGM to (5) because it is smooth. However 
in practice, most oti at the optimum will be on the 
boundary of [0,n -1 ]. According to [7], such ctj's are 
easy to identify and so the corresponding entries in the 
gradient are wasted by AGM. This structure of sup- 
port vector is unique for max-margin models, which 
will also be manifested in our experiments (Section 6). 

In contrast, smoothing i? mp has a lot of advantages. 
First, it directly optimizes in the primal J, avoiding 
the indirect translation from the dual solution to the 
primal. Second, the resulting optimization problem 
is unconstrained. If 57 is strongly convex, then linear 
convergence can be achieved. Third, gradient of the 
smoothed i? C mp can often be computed efficiently, and 
details will be given in Section 5.4. Fourth, the diam- 
eter of the dual space Qi often grows slowly with n, or 
even decreases. This allows using a loose smoothing 
parameter. Fifth, in practice most on at the optimum 
are 0, where R em p best approximates -R C mp- There- 
fore, the approximation is actually much tighter than 
the worst case theoretical bound, and a good solution 
for i? cmp is more likely to optimize i? e mp too. Last but 
most important, the smoothed i? C mp themselves arc 
reasonable risk measures [8], which also deliver good 
generalization performance in statistics. Now that it 
is much easier to optimize the smoothed objectives, a 
model which generalizes well can be quickly obtained 
with the homotopy scheme (i.e. anneal the smoothing 
parameter) . 

Using the same idea of smoothing i? e mp, AGM can be 
applied to a much wider variety of RRM problems by 
utilizing its composite structure. Given a model ip of 
R, if f2(w) + i/>(w) can be solved efficiently, then [9] 
showed that fi(w) + R(w) can be solved in 0(1/ ^/e) 
steps, even if f2 is not differentiable, e.g. L\ norm [10]. 



Similar approach is applied to the Li j0 o regularizer and 
the elastic net [11] regularizer by [12]: 

n(w) = I ||w||* + IH, = 1 £ wl + J2 kl- (7) 

i i 

This fi is strongly convex with respect to (wrt) the 
Li norm, and similarly in many RRM problems fl is 
strongly convex wrt some norm ||-||. For example, the 
relative entropy regularizer in boosting [13]: 

O(w) = 2J Wi log uii (8) 

i 

is strongly convex wrt L\ norm, and the log determi- 
nant of a matrix in [14-16]: 

Q(W) = -logdetlF (9) 

is strongly convex wrt the Frobenius norm. By ex- 
ploiting the strong convexity, [17] accelerated the con- 
vergence rate from 0(1/ y/e) to O(logi). However, 
the prox-function in this case must be strongly con- 
vex wrt ||-|| too. Existing methods either ignore the 
strong convexity in f2 [9], or restrict the norm to L 2 
[10, 17]. As one major contribution of this paper, we 
extend AGM to exploit this strong convexity in the 
context of Bregman divergence. In particular, we allow 
n to be strongly convex wrt a Bregman divergence in- 
duced by a smooth convex function d (to be formalized 
later), where d is in turn strongly convex wrt certain 
norm ||-||. By using d as a prox-function, we manage 
to achieve linear convergence for a wide range of RRM 
problems. 

There are two types of first order methods that both 
achieve the optimal rate. The first type is the origi- 
nal AGM pioneered by Nesterov [6, 17-20], which uses 
a sequence of estimation functions (hence we call it 
AGM-EF). In particular, it uses the whole past iterates 
to progressively build a sequence of estimate functions 
which approximate the objective function. The second 
type was developed by a number of other researchers 
and a unified treatment was given by [9]. Intuitively, 
it generalizes the idea of gradient descent by proximal 
regularization (hence we call it AGM-PR), which can 
be further accelerated by momentum. Therefore, these 
two types of methods are different in concept. In addi- 
tion, both AGM-EF and AGM-PR a oo-memory ver- 
sion which builds a model of the objective by using 
all the past gradients, and a 1-memory version which 
approximates that model by a single Bregman diver- 
gence. 

We choose to base our extensions on AGM-EF, be- 
cause compared with AGM-PR it provides much more 
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Composite 






cvx 


sc 


cvx sc 


Euclidean 


1- memory 
oo-memory 


[19] 
[6] 


[19] 
[17] 


X X 

[17] [17] 


Bregman 


1- memory 
oo-memory 


[23] 
[6] 


X 
X 


X X 
X X 



Table 1. Summary of AGM-EF. "sc" means strongly con- 
vex and "cvx" means just convex, x means novel contri- 
bution of this paper. AGM-PR can handle all but sc. 



flexibility in adaptively tuning L. 1 This is because 
the inductive relationship maintained by AGM-EF in- 
volves a single iteration, while that for AGM-PR in- 
volves two successive steps. The novelty and generality 
of our method in the context of existing methods are 
summarized in Table 1. We further provide bounds on 
the duality gap which amounts to effective termination 
criteria. As another important contribution, we derive 
linear convergence for the duality gap in the context 
of strong convexity. Computationally, at each itera- 
tion our method requires only one projection and one 
gradient evaluation within the feasible region. 2 

Outline of the paper. In Section 2, we follow [24, 
Section 4.1, Definition 3] to extend the concept of 
strong convexity to the context of Bregman divergence. 
We show several properties that will play a key role in 
the subsequent development of the new algorithms. In 
Section 3 and 4, two novel variants of AGM-EF are de- 
veloped along the lines of oo-memory and 1-memory. 
They both achieve global linear convergence by utiliz- 
ing the Bregman generalized strong convexity in either 
f2 or i? cmp . Section 5 elaborates on how to effectively 
apply our method to solve Bregman regularized risk 
minimization problems, and many examples of ma- 
chine learning models are discussed. Also presented is 
the algorithms which efficiently compute the gradient 
and solve the model. Experimental results are given in 
Section 6, where we show empirically that by smooth- 
ing i?emp and exploiting the generalized strong con- 
vexity in f2, the Li and entropic regularized risk min- 
imization problems can be solved significantly faster 
than the state-of-the-art optimizers. 

A ready reckoner of the convex analysis concepts used 
in the paper can be found in Appendix A. 

X A11 APM-PR variants with adaptive L, e.g. [9, 10, 21, 
22], require the estimate of L grow monotonically through 
iterations. And their technique does not extend to asym- 
metric Bregman divergence. 

2 Some AGM algorithms require two projections [6] or 
two gradients [17] per iteration, or evaluate the gradient 
outside the feasible region [19, Section 2.2.4]. 



2. Preliminaries 

From the optimization perspective, the objectives con- 
sidered in this paper have the same form as in [9] . Let 
W be endowed with a norm ||-||. Consider the follow- 
ing nonsmooth convex objective: 

minJ(x) = /(x) + *(x), (10) 

X 

where * : W ^ S := (-oo, +oo] and / : W i-> S are 
proper, lower semicontinuous (lsc) and convex. As- 
sume dom\E' is closed, / is diffcrentiable on an open 
set containing dom\E', and V/ is Lipschitz continuous 
on dom , F, i.e. there exists L > such that 

||V/(x)-V/(y)|r<L||x-y|| x,yedom*. 

Some special cases are in order. The first is constrained 
smooth optimization, where "J is the indicator function 
for a nonempty closed convex set QCP: 

T , , fo ifxeQ 

*(x) = < . 
I +oo otherwise 

Therefore, in the sequel we will always discuss uncon- 
strained minimization for J(w), although this is just 
a matter of notation. A second example is the L\ reg- 
ularization, where 

*(x)=f>|. 

In fact, many machine learning problems are special 
cases of (10) and details can be found in Section 5 and 
[25, Table 5]. 

Next, we will present in detail two additional assump- 
tions: strong convexity of / and "J in the sense of 
Bregman divergence, and efficiently solvable ground 
optimization problems. 

2.1. Extending strong convexity to Bregman 
divergence 

Let d be a differentiable and a strongly convex function 
with respect to some norm ||-||. 3 Then we can define 
a Bregman divergence: 

A d (x, y) = d(x) - d(y) - (Vd(y), x - y) . 

By the definition of cr-sc, we have 

A rf (x,y) > - ||x-y|| , forallx,y. 

Furthermore, Bregman divergence can be used to gen- 
eralize the concept of strong convexity [24, Definition 

3, Chapter 4]. 

3 AGM capitalizes on two properties of the norm: con- 
vexity and linearity (||c ■ x|| = c ||x||). 
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Definition 1 (Strong convexity for Bregman diver- 
gence). A convex function f is said to be X strongly 
convex with respect to d (X-sc wrt d) with A > if for 
all x and y we have 

/(x) > /(y) + (g, x - y) + AA d (x, y) V g G df(y). 
If A > 0, we say f is strictly strongly convex. 

For example, with d(x) = | ||x|| 2 where the norm is 
Euclidean, we recover the conventional strong convex- 
ity. Here we allow A to be for a unified exposition, 
and trivially all convex functions are 0-sc wrt any d. It 
is noteworthy that Definition 1 preserves some impor- 
tant properties of the conventional strong convexity. 
Property 1. If f is X-sc wrt d, then f must be Xa-sc 
wrt \\-\\. Hence for any a G [0, 1] and x, y, we have 

/(ax + (1 - a)y) < a/(x) + (1 - a)/(y) 

- ^ a (l-a) ||x-y|| 2 . 

Property 2. If cti > and /,; is A^-sc wrt d ("A; > 0), 
then a ifi * s J2i oiiXi-sc wrt d. 
Property 3. d(x) is 1-sc wrt d. So by Property 2, 
Ac((x,xo) is also 1-sc wrt d for any fixed jcq. 

Many problems are constrained to a feasible region Q. 
In the sequel we will always assume that Q C domd 
and Q is closed and convex. 

Property 4. Suppose f : R™ i — >■ M is proper, Isc, and 
X-sc wrt d and x* = argmin x /(x). Then 

/(x) - /(x*) > AA d (x,x*) for all x G dom/. 



The proof simply uses the definition of A-sc and the 
optimality condition of x* : (g, x — x*) > for all g G 
<9/(x*) and x G dom/. 

A direct application of Property 2, 3 and 4 gives a very 
important inequality which is also used extensively in 
[9, Property 1] and [26, Lemma 6]: 

Property 5. Suppose f is proper, Isc, and convex with 
range R. Let x* = argmax^. /(x) + A^(x, x ), then for 
all x 

/(x) + A d (x,x ) > /(x*) + A d (x*,x )+A d (x,x*). 

The following property of Bregman divergence plays a 
key role in keeping a compact expression of our esti- 
mation functions. 

Property 6. For all ai > and Xj in the interior of 
dom d, define 

g(x) := (s, x) + ^ a « A d (x, x 0- 



Then q(x) can 6e equivalent expressed as 

q(x) = aA d (x,x*) + 6, 

where a — X)i a «; x * = argmin x g(x), and b — g(x*). 
iVo£e x* is the unconstrained minimizer o/q(x). 

Proof. By the optimality condition of x* we have 

^s + ^a l (Vd(x*)-Vd(x. i )),x-x*^ = Vx. (11) 

This equality must be changed to > if x* is the min- 
imizer of g(x) over a constrained set Q domd. By 
definition, 

q(x*) = (s,x*) +y^q i A d (x*,x^). 

i 

Subtracting it from the definition of g(x) we get 

g(x) = q(x*) + (s,x - x*) 

+ ]T a,(d(x) - d(x*) - (Vd(xi), x - x*)) 

i 

= <j(x*) - ^ a,:(Vd(x*) - Vd(x 4 )),x - x*^ 
+ ^ a,(d(x) - d(x*) - (Vd(xi), x - x*)) 



g(x*) + ^S a< ) A d (x,x* 



Assumption 1. In the objective (10), we will assume 
that f is Xi-sc and \1/ is X2-SC wrt a given d (Ai, A2 > 
0). Then / + \E r is X-sc, where 

A := Ai + A2. 

2.2. Assumption on the ground optimization 
problem 

We assume it is possible to efficiently solve the follow- 
ing ground problem: 

Assumption 2. Given an arbitrary linear function 
(u, x), a.i > and Xj G dom $ (i S [k] := {1, . . . , k}), 
assume the following optimization problem can be 
solved efficiently: 

k 

min (u, x) + b + V a,A d (x, x 4 ) + *(x). (12) 

X ^ 

for different k, we call the assumption BD-k. 

In [18] and [19], the 1-memory AGM-EF for general 
convex objective assumes BD-1. In [6] and [17], BD-00 
is assumed in the sense that for arbitrary k < 00, (12) 
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is assumed to be efficiently solvable. In our later 1- 
memory AGM-EF, we will assume BD-2 if Ai > 0. Al- 
though most literature assume BD-1, it is actually not 
hard to see that extension to BD-2 does not cause any 
real difficulty. In fact, even BD-oo is feasible as long 
as J2i aiVc?(xi) can be aggregated efficiently (which is 
often true). 

As a direct consequence of BD-1, now that the / in 
(10) is Ai-sc and L-l.c.g, J(x) can be solved in one 
step if L = cr\\. To see this, by definition for all x 

/(x) < /(x ) + (V/(x ),x - x ) + | ||x - x || 2 , 
and 

/(x) > /(x ) + (V/(x ), x - x ) + AiA rf (x, x ) 

> /(x ) + (V/(xo),x - x > + ^ ||x - x || 2 . 

So clearly L > crXi. If L = erAi, then 

/(x) = /(x ) + (V/(x ), x - x ) + Ax A d (x, x ). 

Hence, /(x) + ^(x) exactly satisfies the precondition 
of BD-1. Therefore, in the sequel we will assume 

L > a\ 1 . 

c := -k- can be viewed as the condition number. 

(T Ai 

BD-2 allows us to inductively apply Property 6 to sim- 
plify the expression of the following function 

n 

g„(x) := a A d (x,x ) + ^ i b * + ( u *> x ) + a<Ad(x,x i )] 
»=i 

into 

9«( x ) = (jt, Ad ( x ' x «) + n>l 

where x* = argmin x q n (x) . Let q (x) = a A d (x, x ). 
Then simplify q% (x) into the sum of a constant and a 
Bregman divergence by Property 6: 

9i( x ) = (°o + oi)A d (x, x*) + gi(x*), (13) 
x^ = argmingo(x) + b\ + (ui,x) + aiAd(x,xi), 

X 

since x^ can be computed efficiently according to as- 
sumption BD-2. Next, <? 2 (x) can be simplified by using 
(13) and Property 6 again: 

92 (x) = (a +ai)A d (x,x;!;) +q 2 (x^), 

x 2 = argmingi(x) + b 2 + (u 2 ,x) +a 2 A d (x,x 2 ). 

X 

This incremental scheme is especially useful when the 
argmin of all 9fc(x) is readily available, [e.g. 23, Section 
5]. 



Notations. Lower bold case letters (e.g., x, a) de- 
note vectors, Xi denotes the i-th component of x, 
refers to the vector with all zero components, is 
the i-th coordinate vector (all 0's except 1 at the i- 
th coordinate) and S n refers to the n dimensional 
simplex {x £ [0, 1]™ : Y^i=i Xi ~ ■"•}■ Unless specified 
otherwise, (•, •) denotes the Euclidean dot product 
(x, w) = ^iXiWi. We denote K := K U {oo}, and 
[t] := {1, . . . , t}. From now on, we will always fix the 
d in the context and omit the subscript d in A^. 

We follow the definition of norms in [6] which we recap 
here. Suppose a finite dimensional real vector space E 
(e.g. W) is endowed with a norm ||-||. The space of 
linear functions on E is called the dual space which 
we denote as E*. The norm of E* is defined as 

||s||* := max (s, x) . 

xGE:||x||=l 

Suppose A is a linear operator from E\ to E%, and 
Ei has norm H-^ for i = 1,2. Then the norm of A is 
defined as 

\\A\\ := max (Ax, a) . (14) 

x£-Ei,a£-E2!ll x lli — ||ck|L = 1 

If we define an adjoint operator A* : i? 2 H> E* as 

(A*a,x) := (Ax, a) , Vx € E\, a € E 2 . 
Then it can be shown that 

|| A* || = max (A*a,x) 

x £ E i , a £E E2 , 1 1 x 1 1 1 — 1 1 ol 1 1 2 — 1 

= max (Ax, a) = ||A|| . 

x£ Ei,ctG £^2,11x1^ — jl «|| 2 — 1 

The definition of matrix norm in (14) implies that 

||Ax|p < ||A|| ||x|| Vxe4 
\\A*a\\* < ||A*|| ||a|| Vae4 

To simplify notation we denote 

£/(x; y, Ar) := /(y) + (V/(y), x - y> + A x A(x, y). 
If / is Ai-sc, then ^/(x;y, Ai) < /(x) for all y and x. 

3. oo-memory AGM-EF 

The oo-memory version of AGM-EF refers to the 
class of algorithms which use in each iteration all the 
past gradients V/(ui), . . . , V/(ufe). We present the 
method in Algorithm 1. 

4 One can verify by simple algebra that u^+i is a convex 
combination of zt and Xfe. 
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Algorithm 1 oo-memory AGM-EF (AGM-EF- 
oo). 



9 

10 
11 



Arbitrarily initialize x € dom^. Set z <— x . 

Set A <- 0. 

V>o(x) <- A(x,x ). 

for fc = 0, 1, . . . do 

Denote as a k +i the positive root (in a) of 

(a + A fc )(Ai<z + AA fe + 1) + aA 2 A fe = Lo^a 2 . 

Ak+i <- A k + a k +i, 

ti <r- 1 + AA fe , t 2 <- Aia fc+ i, r 3 <- A 2 a fe+ i-j^-, 
t <- n + t 2 + r 3 . 

u , afc + iTiz fc +(T J 4 fc +T 3 a fc + 1 )x fc 4 

tpk+i(x) <- fAfc(x)+a fe+ i[*(x)+£/(x;u fe+ i, Ai)]. 
Find z fe+ i <- argmin x ^fe+i(x). 
Xfe+i «- (AfcX/c + afe+iz/s+i) /Ak+i. 
end for 



The main idea of the algorithm is to approximate J(x) 
by a sequence of functions V'fc that are constructed in 
Step 8 of Algorithm 1, and then ensure the following 
relationship at all iterations (fc > 0): 

A k J(x k ) < min^ fe (x). (15) 

X 

By construction, for all k > Q 

fc 

ipk(x) = A(x,x )+^a i [*(x)+£ / (x;u i! Ai)]. (16) 
i=l 

Summation from 1 to is assumed to be 0. Now it is 
not hard to see that relationship (15) implies rates of 
convergence: 

Lemma 3. If (15) holds for all k > 1, then for any 
x 6 dom^, we /icwe 

J(x fe )- J(x) <A^A(x,x ). (17) 

Proof. By (16), we have for all fc > 1 

fc 

^fc(x) = A(x,x )+^a i [*(x)+£ / (x; u;,Ai)] 

fc 

< A(x,x )+^a 4 [*(x) + /(x)] 

= A(x,x ) + A & J(x). 
Combining with (15), we get (17). ■ 

Therefore, the rate of convergence totally depends on 
how fast Ak grows. We will show that Algorithm 1 
yields A k ~ fc 2 if A = 0, or A k ~ e k if A > 0. All 
updates are also kept efficient. We next prove (15) 
and lower bound the growth rate of A k . 



Lemma 4 (Eq (15)). The sequence {x^} generated by 
Algorithm 1 satisfy for all k > 

A k J(x k ) < min^fc(x). 

X 

Proof. We prove by induction. First check both sides 
are for k — 0. Now suppose (15) holds for some 
step fc > 0. By (16) and Property 2, tp k must be 
(XA k + l)-sc wrt d. So by Property 4 and the fact that 
Zfe minimizes ip k , we have 

^(zfc+i) > V'fc(zfc) + (AAfc + l)A(z fc+ i,z fc ) 

> A fc J(x fc ) + (AA fe + l)A(z fc+ i,z fe ), (18) 

where the second inequality is by induction assump- 
tion. So 

mini/) fe+ i(x) = ip k+ i(z k+1 ) 

X 

= ip k (z k+ t) + a k+1 £ f (z k+1 ;u k+1 ,\i) + a k+1 ^(z k+1 ) 

(a) 

> Afe/(x fe ) + A fe *(x fe ) + (1 + AA fc )A(z fc+ i, z fe ) 
+ afc+i^/(zfc+i; Ufe+i, Ai) + afe_|_i*(z fe+ i) 

(fc) 

> A fc [/(ufc+i) + (V/(u fe+1 ),x fe - u fe+1 )] 

+ (1 + AA fe )A(z fe+1 , z fc ) + A fe *(x fc ) + a fc+ i*(z fe+1 ) 

+ Ofc+x[/(u*+l) + (V/(u fe+ i),Z fe+ l - Ufc+i) 

+ AiA(z fe+ i,u fe+ i)] 
= A k+1 f{u k+1 ) + A fe *(x fe ) + a fc+ i*(z fe+ i) 
+ (V/(ufc +1 ), A fc x fe - A fc+ iu fe+ i + a k+1 z k+1 ) 
+ TlA(z fc+ l,Z fc ) +r 2 A(z fc+ i,u fc+ i) 

(c) 

> A k+1 f(u k+1 ) 

(J 2 

+ A fe+ i*(x fe+ i) + -r 3 ||z fe+ i - XfcH 

+ (V/(u fc+ i), A fc x fc - A fe+ iu fe+ i + a fe+ iz fc+ i) 

+ | r i ll z fc+i - Zfe 1 1 2 + ^t 2 ||z fe+ i - Ufc+i || 2 

W r 

> A fc+ i*(x fc+ i) + A fe+ i /(u fc+ i) 

afc+i , A fc+1 u fe+1 -A fe x fe \ 



+ 



A k +i 

Tl + T2 + T3 



A 



fc+1 



Zfc+1 



afc+i 

TlZfc + T 2 Ufe_|_l + T 3 Xfc 



Tl + T 2 + T 3 



= A fe+ i*(x fe+ i) + A k+1 /(u fc+ i 



afc+i A-,,, N ^4fc+iUfc + i — AfeXfe \ 

+ -j < V/(u fc+ i),z fc+ i ) 

Afc+i \ afe+i / 



1 / a fc+l 

2 V^fc+i 



Zfc+1 



TlZfc + r 2 u fc+1 + r 3 x fe 



ti + r 2 + r 3 



(/) 



A fe+ i*(x fe+ i) + A fc+ i /(u fe+ i) 
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L 2 
+ (V/(u fc+ i),x fe+ i - u fe+ i) + - ||x fc+ i-u fc+ i|| 

(g) 

> A k+1 1 i'(x k+1 ) + A k+1 f(x k+1 ) = A k+1 J(x k+1 ). 

Here, step (a) is by (18). (b) is by the convexity of / 
(at Ufc + i). (c) is by the A2-SC of \I/ and Property 1. (d) 
is by the convexity and linearity of ||-||. (e) is by the 
rule of choosing a k+ i in Step 5 of Algorithm 1. (f) is 
by the definition of x^+i and Ufc+i. (g) is by L-l.c.g 
of/. ■ 

Next, we can lower bound the growth rate of A k . 
Lemma 5. Let k > 1. Then 

2k-2 ' 

A k > max { (k + i f , a - 



Proof. Since A = 0, so by solving Step 5 in Algorithm 
1, we get Ai = L _ a aXi ■ Hence the lemma clearly holds 
for k = 1. For all k > 1, denote 

M = (a k+ i + A k )(Xia k+ i + XA k + 1) + a k+1 X 2 A k 
= A k+ i + XA k A k+ i + \ia k+ iA k+ i + X2a k+ \A k . 

By the choice of a k+ i in Step 5 of Algorithm 1, we get 

A k+1 < M = -{A k+1 -A k f 

<j 

L ' n2 




= - {y/A k +i + y/Akj (y/A k +i - V^k 

2 



< — A k+1 ( ^A k+1 - ^[~A k 
o~ \ 

So when A = we have 

'k - I lo- 
L 



(19) 



A k > 



, A l 



4i 



(k + lf 



When A > 0, we have 



4£ 



XA k A k+1 <M< —A k+1 y/A k+1 
a v 



x -4 f 



where the last step is by (19). So 
V A k+i > (1 




which directly implies the second term in max. 

Combining Lemma 3, 4 and 5, we derive 
Theorem 6. For all k > 1 and x 6 dom^, 



J(xfe) - J(x) < A(x, x )min. 



L — ctAi 



41 



a(k + l) 2 ' 




-2k+2 



Therefore, as long as one of Ai and A2 is strictly pos- 
itive such that A = Ai + A2 > 0, J(x k ) converges 
linearly. When Ai = and A2 > 0, ip k contains only 
one Bregman divergence making it easier to optimize. 

Remark 1. If (18) is replaced by 

ip k (zk+i) > ipk(zk) + {XA k + l)A(z fc+ i,z fc ) 

> A k J(x k ) + {XA k + 1)| ||z fc+1 - z fe || 2 , 

then it is not hard to see that the proof of Lemma 4 
still goes through. So * does not need to be A2-SC wrt 
d, and it suffices to be A2C strongly convex wrt ||-||. In 
practice, checking and satisfying the latter condition 
can be much easier. Similar remark can be made later 
for AGM-EF-1, and for the ease of exposition we will 
still assume VE' is A 2 -sc wrt d. 

3.1. Notes on the Computations 

The whole algorithm relies on solving z k efficiently, 
and it can be dealt with in two ways. First, by (16), 
minimizing - 0fc( x ) on ly requires solving the following 
form of problem: „ 



min A fc *(x) + A(x, u ) + Ai > j a,A(x, U;) 
'aiV/(ui),x 




i=0 



This is feasible by Assumption 2, and in practice the 
gradients of / and d can be aggregated on the fly. 

The second method requires making one more assump- 
tion, in addition to the usual assumption dom ^/ C 
dom d. 

Assumption 7. dom d C dom ^ . 

This assumption is often met when d is the entropy 
and dom\l/ is the simplex. It ensures that z k := 
min x6 dom* V'fc( x ) is a l so a solution of the uncon- 
strained optimization min xS dom d V'feW- Then when 
^ is affine on its domain, we can apply Property 6 
and the subsequent discussion on inductively updating 
ijj k (x) . This scheme is particularly useful in Algorithm 
1 because the minimizer z k is already available. 

Even if Assumption 7 does not hold and z k is not an 
unconstrained minimizer of i/>fc(x), one can still spend 
extra computations to find the unconstrained mini- 
mizer and inductively update ip k (x). This idea will be 
useful if the gradient aggregation in the first method 
is not viable. 

3.2. Adaptively tuning the Lipschitz constant 

The Algorithm 1 requires the explicit value of L. This 
is usually not available, or the global maximum cur- 
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Algorithm 2 AGM-EF-oo with adaptive L. 

Require: Down scaling factor 7^ and up scaling fac- 
tor 7„ (7d,7„ > 1). An optimistic estimate L < L. 
1: Arbitrarily initialize Xo € dom^. Set zq 4— Xq. 
2: Set A <- 0. 
3: Vo(x) A(x,x ). 
4: L <~ i * Id * 7«- 
5: for k = 0, 1, . . . do 
6: L fe+ i <- L fc /(7d * 7„). 
7: repeat 

8: ifc+1 ife+1 * T«- 

9: Assign to a^+i the positive root (in a) of 

(a+A i; )(Aia+AA fe + l)+aA 2 A fe = Lk+ia^a 2 . 
10: Do step 6 to 10 of Algorithm 1. 
11: until ^ fc+ iJ(x fc+ i) < ip k+1 (z k+1 ). 
12: end for 



vature is much larger than the local directional cur- 
vature. As a result, the steps size 1/L becomes too 
conservative. From the proof of Lemma 4, it is clear 
that L is used only to ensure (15). So we can probe 
smaller values of L. The modified algorithm is given 
in Algorithm 2. 

The inner "repeat" loop must terminate in a finite 
number of steps because L k grows exponentially and 
once Lfc > L the "until" condition must be satisfied. 
And the number of steps in this inner loop is loga- 
rithmic in L, with the final L k < "f u L. Moreover, 
this Lk is decayed by a factor of 7^ before being used 
to initialize L k +i- This is in sharp contrast to AGM- 
PR where the estimates of L must grow monotonically 
through iterations. Let us formally characterize how 
adaptively tuning L leads to faster convergence rates 
through faster growth rate of A k . 

Lemma 8. For all k > 1, 



Ak > max 




Proof. Simply replace the L in (19) by Li + \. ■ 

In practice, we observed that the Lk is often only 10 
per cent of the real L and therefore by Lemma 8 the 
convergence rate is 10 times faster than using L. More- 
over, the Lk in successive iterations are quite close so 
the inner loop terminates in only 2-3 steps. 

This adaptive scheme relies on the fact that the key 



relationship (15) is independent of L and involves func- 
tion values only at two points (rather than globally). 
In contrast, the algorithm and analysis in [26] keep a 
global relationship which explicitly involves L, making 
it hard to accommodate adaptive L. 

We also tried to adaptively tune A, but not successful. 
This is turns out to be very hard because the proof 
uses A as a a global property (recall the fact that ipk 
must be (XA^ + l)-sc wrt d), while L is used only at 
Ufc_|_i and Xfe + i in Step (g) of the proof of Lemma 4. 

3.3. Bounding the Duality Gap 

Algorithm 1 does not have a termination criterion, and 
a natural criterion will be based on the duality gap. 
Furthermore, in some applications like (1) the primal 
problem is nonsmooth and AGM-EF-00 is applied only 
to its dual problem which is Leg. So it is necessary to 
convert the dual iterates at each step into the primal, 
and characterize the convergence rate in the primal. 
In this subsection, we extend the technique in [2, Sec- 
tion 2] to the case of composite objective. Except the 
strong convexity, our whole setting and procedure bear 
much resemblance to [9], [2, Theorem 2.2], [6, Theo- 
rem 3] and [17, Section 6]. We are unaware of any 
existing result which shows linear convergence of the 
duality gap as we will describe below. 

Consider a minimax problem 

min max </>(x, a) + ^(x). 

x a.eQ-2 

Here $ : W 1— > R is proper, lower semicontinuous and 
A2-SC wrt d (A2 > 0). Let ^ satisfy Assumption 2. 
Q2 is a compact convex set in the Euclidean space. 
4> : W x Q 2 i-)- K is continuous on dom* x Q 2 . For 
all fixed a E Q 2 , </>(•, a) is Ai-sc wrt d (Ai > 0) and is 
diffcrentiable on a open set containing dom^P. For all 
fixed x € domVl/, </>(x, •) is strictly concave. Therefore, 
the argmax ag g 2 0(x, a) is unique and we denote it as 
a(x). 

Let us define 

/(x) := max </>(x, a). (20) 

a£Q 2 

Then by Denskin's theorem [27, Theorem B.25], / 
must be convex and diffcrentiable on dom^. We fur- 
ther assume that / is L-l.c.g on dom 1 ]/. A key strong 
convexity property of / is: 

Lemma 9. Given all the above assumptions on (p, 
/(x) must be Ai -sc. However, the converse is not nec- 
essarily true, i.e. /(x) being \\-sc does not entail that 
(/>(■, a) is Xi-sc for all fixed a G Q 2 . 
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Proof. For any Xi,X2 € dom^, we have 

/(x 2 ) = max 0(x 2 ,a) > 0(x 2 , a(xi)) 

> 0(xi,a(xi)) + (V x 0(xi,a(xi)),x 2 - xi) 

+ AiA(x 2 ,xi) 
= + (V/(xi),x 2 - x x ) + A 1 A(x 2 ,x 1 ), 

where the last step is by Denskin's theorem. ■ 

We also define a dual objective 
J(x) := \I/(x) + max </>(x, a) 

a£Q2 

D(a) := min{0(x,a) + *(x)} for a E Q 2 (21) 

X 

where the argmin in (21) may be not unique and D(a) 
may be nonsmooth. Our assumptions above ensure 
that for any a G Q 2 and any x, the following is true: 

D (at) < J(x), and max D(a) — min J(x). 

a£Q 2 x 

When applied to minimize J(x), AGM-EF-oo (with 
or without adaptive L) produces a sequence of 
{xfc, Ufc, Zfc}. It is our goal to design a sequence of 
dual variables {atk} based on {xi,Ui,Zi : i < k} such 
that the duality gap 

8 k := J(xfc) - D(a k ) 

goes to fast. Since 

8k > ^(xfc) — max D(a) = J(x^) — min J(x), 

aeQ2 x 

so once 8k falls below a prescribed tolerance e, Xfc 
is guaranteed to be an e accurate solution of J. In- 
deed we will show that the following construction of 
a.k meets our need: 



1 k 

Oik = -^-y^a i Q!(u. t ). 



(22) 



where a* and Ak are also from AGM-EF-oo. (22) can 
be equivalently reformulated into a recursion which 
allows efficient update of otk- 

/ \ j A k ak+i , . 
oci = a(ui), and a k +i = -. a k + -. a(u k+ i). 

A k +l Ak+i 

Theorem 10. Suppose a sequence {x^, u^., Zj.} is pro- 
duced when AGM-EF-oo is applied to minimize J(x) 
by treating f as Ai -sc. Then the {a.k} defined by (22) 
satisfies a.k £ Q2 and 



S k = J(x fc ) - D(a k ) < -7- max A(x,u ). (23) 

A k xGdom* 



Proof. Since € dom^, so ot(ui) € Q 2 . And ctk is 
a convex combination of a(\ii) (i < k), so otk € Q 2 - 
Using the fact that 0(x, a) is Ai-sc in a for all fixed 
x, we have 

= f(ui) + (V/(ui), x - iii) + Ax A(x, Ui) 

= (/>(u i ,a(u i )) + (Vx0(u i ,a(u i )),x - u, : ) +Ai A(x, u*) 

<0(x,a(ui)). (24) 

Now by using relationship (15) and (16), we have 

A k J(x k ) 

< min I A(x, u„) + A k V(x) + ^ a^/(x; u l; A x ) j 

fc 

< minA(x, u ) + A fc *(x) + a;0(x, a(uj)) 

X * — » 

t=l 

< minA(x, u ) +Afc*(a) + A k <j) x.^-Va.afu,) 

< max A(x, u ) -I- A k min {*(x) + (/>(x, ock)} 

xgdom * x 

= max A(x, u ) + A k D(a. k ). I 

xfEdom 

So <5fc converges linearly as long as Ai + A 2 > 0. If 
dom'S is unbounded and max xS dom* A(x, Uo) = 00, 
then the bound in (23) becomes vacuous. 

We emphasize that in Theorem 10, AGM-EF-oo is in- 
voked by treating / as Ai-sc, although the real strong 
convexity constant X[ of / may be greater than Ai. 
In this case, the duality gap will decay at a slower 
rate than that for the gap of J (by using X[ in AGM- 
EF-oo). However the strong convexity of is still fully 
utilized in the duality gap, and in many machine learn- 
ing problems the strong convexity does come from \& 
rather than / (i.e. Ai = A' : = 0). 

4. 1-memory AGM-EF 

Note that AGM-EF-oo keeps a nonparametric form 
(16) of the model Vfe( x ) whose complexity grows with 
iteration. In 1-memory AGM-EF, the model is com- 
pressed to a simple parametric form in each itera- 
tion. Auslender and Teboulle [28] gave a Bregman 
version for unconstrained optimization. [18] provided 
an algorithm for constrained problems with Euclidean 
distance as the prox-function. However, only [26] 
and [9] accommodate both Bregman divergence and 
constraints. But their algorithms do not extend to 
strongly convex objectives and restrict the estimate 
of L to be nondecreasing through iterations. There- 
fore, we propose in this section a 1-memory AGM-EF 
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Algorithm 3 1-memory AGM-EF (AGM-EF-1). 



Arbitrarily pick u € dom^. 

Initialize c <— ^ + A 2 . 

g (x) <- §A(x, u ) + (x) +£/(x; u , 0). 

x = z <- argmin x q (x). 

for fc = 0, 1, . . . do 

Assign to cik+i the positive root (in a) of 

ct(1 — a)(c k + A 2 a) + a\ia = La 2 . 

Cfc+i <- (1 - a fe+ i)c fc + (Ai + A 2 )a fc+ i. 

n «- (1 - a k+ i)c k , t 2 <- Aia fe+ i, 

T 3 <- A 2 a fe+ l(l - Ofc+l), T 4 T\ -\- T 2 



T3- 



Ufc+1 <- 



(T-(Ti+-r 2 )a fc + i)x fc +Tia fc+ iZ fc 5 



10 

11 

12 
13 



T-T 2 a k+1 

V>fc+i( x ) <~ (1 - afe+i)%(x) 

+a fc+ i[£/(x; u fe+ i, Ai) + *(x) 
z fc+ i «- argmin x i/'fc+i(x). 
x/c+i «- (1 — a/c+i)xfc + Ofc+iZfc+i. 
q k +i(x) <- Cfe + iA(x,Zfc+i) + ^fc+i(z/c+i)- 
end for 



which uses Bregman prox-function, and allows con- 
straints and non-monotonic adaptive tuning of L. 

Arbitrarily pick Uo G dom and initialize by 

<Zo(x) := -A(x, u ) + /(x ) + (V/(u ), x - u ) + *(x) 
a 

x = z = argmin(7o(x) 

X 

L , 
c = — + A 2 . 

(7 

Then for all fc > 0, define: 

^fe+i(x) = (l-a fe+ i)g fe (x)+a fe+ i[£/(x; u fe+ i, Ai) + *(x)] 
z fe+ i = argmin-0 fc+ i(x) 

X 

Cfe+i = (1 - a k+ i)c k + Xa k+ i 
q k+ l(x) = c fe+ iA(x,z fc+ i) + ip k+ i(z k+1 ). 



actually solves a constrained optimization, and then 
(11) must be changed to > which breaks Property 13. 

The proof of rate of convergence for Algorithm 3 relies 
on the following two relations: for all fc > and x G 
dom 'J, 

ft +i(x) - J(x) < (1 - a fc +i)( ft (x) - J(x)) (26) 
J(x fe ) < gfc(z fc ). (27) 

From these three inequalities, we get for all x G dom ^P, 

(27) (25) 

J(xfe) < q k {z k ) < g fe (x) 

(26) A. 

< J(x) + («,(x)- J(x)) • JJ(1 - a,). (28) 

i=l 

So the gap J(xfc) — J(x) decays at the same rate as 
rii=i(l — a i)- 6 Compared with the oo-memory AGM- 
EF, the additional inequality (26) is now needed be- 
cause the models q k here are approximations of the ip k 
in (16). Next, we prove the three relations one by one. 

Lemma 11 (Eq (26)). For all k > and x, we have 
<7/c + i(x) - J(x) < (1 - a fe+ i)(<j fc (x) - J(x)). 

Proof. Since z k+ i minimizes ip k +i(x) and ip k -\-i(x) is 
Cfe + i-sc, so by Property 5 we have 

^fe+i(x) > ip k+ i(z k+ i) + c fc+ iA(x,z fc+ i). (29) 

So for all x G Q, 

(1 - a fe+ i)g fe (x) + a k+1 J(x) 

> (l-a fe+ i)Q fe (x)+a fe+ i[£ / (x;u /s+ i,Ai)+*(x)] (30) 
= V'fc+iCx) 

> Vfc+i( z fc+i) + c fe+ iA(x,z fc+ i) (by (29)) 

= 9fc+i(x). (31) 



By construction for all fc > 0, q k is Cfc-sc and Vfc+i i s 
strongly convex with constant (1 — a k+ i)c k + Aafc+i, 
i.e. c/j+i-sc. Clearly, for all fc > 1 

min tp k (x) = ip k (z fe ) = q k (z k ) = min % (x) . (25) 

X X 

But except at x = z k , q k (x) ^ V'fc(x) in general. The 
only case where q k (x) = ^fe(x) is when ^(x) is an 
affinc function on dome? and domd C dom^. Then 
an inductive application of Property 6 reveals q k {x) = 
-0fe(x). Lemma 5.2 of [23] is exactly this case with 
^(x) = 0. However, when dome? G\ dom 'J' then z k+ i 



Lemma 12 (Eq (27)). For all fc > 0, J(x fc ) < q k {z k ). 

Proof. We prove by induction. First, when fc = 
<7o(zo) = >/(xo). Now suppose (27) holds for certain 
fc > 0. Then 

9fe+i(z*;+i) = V'fc+i^fc+i) 

= {l-a k+ i)q k (z k+1 )+a k+ i[£f(z k+1 ;u k+ i, Ai)+^(z k+1 )] 

(a) 

> (1 — a k+1 )[q k (z k ) + cfcA(zfc+i,Zfe)] 



3 Ufc+i is clearly a convex combination of z k and Xfc. 



10 



6 The last inequality of (28) does not require go(x) > 
J(x). But go( x ) > J( x ) can be easily proved by Lemma 
15. 
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+ a k +i[lf(zk+r, Ufc+i, Ai) + ^(zfc+i)] 

(b) 

> (1 - a fe+ i)[/(x fe ) + *(x fe ) + c fc A(z fc+1 , z fc )] 

+ Ofc + i^/(zfc + i;Ufc + i, Ai) + afe+i^Zfe+i) 

(c) 

> (1- a fc+ i)[/(u /£+ i) + (V/(u fc+ i),x k - u fc+1 > 

+ ~Y~ ll z fc+l ~ z *l| 2 ] + Ofc+l[/(Ufc+l) 

+ (V/(u fe+ i),z fe+ i - Ufe+i) + ||z fc+ i - u fc+ i|| 2 ] 
+ (1 - a fe+ i)*(x fc ) + a fe+ i*(z i;+ i) 

(d) 

> *(x fc+ i) + /(u fc +i) 

+ (V/(ufc+i), (1 - a k+ i)x k + flfe+izfe+i - Ufe+i) 

0~ 2 

+ 2° k ^ ~ a k+i) " Zfc+1 _ Zfe " 

+ 2"Aia/c + i ||zfe+i — Ufc + i|| 2 

+ ^■A 2 a fe+ i(l - Ofc+i) ||z fe+ i - Xfe|| 

(e) 

> *(x fc+1 ) + /(u fc+1 ) 

+ (V/(u fc+ i), (1 - a k+ i)x k + a k+ iz k+1 - u k+1 ) 

TlZfc + T 2 U fe+ l + r 3 x fc 



+ 2 ( Tl + T 2 + r 3) 



Zfc+l 



7-1+7-2+ T 3 



(/) 



*(Xfc + l)+/(Ufc+l) 

L 2 

+ (V/(u fe+ i),x fc+1 - u fe+ i) + - ||x fe+ i - Ufe+i I 

(9) 

> *(x fc+ i) + /(Xfc+l) = J(x fc+ i), 



where (a) is because Zfc minimizes q k and g& is Cfc- 
sc. (b) is by the induction assumption, (c) is by the 
convexity of / and er-sc of d. (d) is by the A2-SC of "3/ 
and Property 1 . (e) is by the convexity of norm, (f ) is 
by the definition of x^ + i and u k+ \, and the choice of 
Ofc+i. (g) is by the L-l.c.g of /. ■ 

Noting that Co > A by definition, we can bound 
rii=i(l — a i) by invoking Lemma 2.2.4 of [19] with the 
strong convexity constant being A and the Lipschitz 
constant of the gradient being 

L' :=- + A 2 . 
a 

It is easy to verify that the condition number L' j A is 
monotonically decreasing in A2. 
Lemma 13. For all k>\, we have 

k 



JJ(l-fli) < min I 1 



i=l 



Finally we bound go( x ) — ^( x ) by 

?o (x) - J(x) 

= -A(x, u ) + (V/(u ), x - u > + /(u ) - /(x) 
a 

< - A i) A ( x , «o). (by Ai-sc of /) (32) 

By (28) and the definition Co = L' , we get 
Theorem 14. For all k > 1 and x G dom^, 

J(x fe ) - J(x) 



< (g (x)- J(x))mhW 1 



U ) ' (2 + fc) 2 



< I Ai] A(x,u )rrmJ ( 1- 



L' J ' (2 + k) 2 



This rate is completely independent of $ (except A 2 ). 
Although not needed by the proof, we can further show 
that Ofc(x) > J(x) for all k > and x € dom^f. 

Lemma 15. gfe(x) > J(x) /or all k > and x G 
dom 'J. 

Proof. When k = 0, 
g (x) = -A(x, u ) + /(uo) + (V/(u ),x-u )+*(x) 

(T 



> § llx-uoll 2 



/(uo) + <V/(uo),x-uo)+*(x) 
> /(x) + *(x) = J(x). 



Suppose fc > 1. By (25), flfc(x) > q k (z k ). By Lemma 
12, g fe (z fc ) > J(xfc). So 

<7fc(x) > qfc(z fc ) > J(xfc) > J(x). ■ 
4.1. Adaptive L 

It is straightforward to incorporate backtracking of 1/ 
into the algorithm. We present this variant in Algo- 
rithm 4. Suppose at each iteration the inner loop ter- 
minates with L k and define L' k = L k /a + A2. Noting 
Co = L' and slightly changing the proof, Lemma 13 
can be extended as follows: 

Lemma 16. For all k > 1, we have 

/ ( i / 



n(l-aO<min 11 I-W77 



— ( 2 V 1 



11 



J \V^0 j=l 

Obviously, when — L' we recover Lemma 13. 
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Algorithm 4 AGM-EF-1 with adaptive L. 

Require: Down scaling factor 7^ and up scaling fac- 
tor 7„ (7d,7 M > 1)- An optimistic estimate L < L. 
1: Arbitrarily pick Uo £ dom\I f . Lq <— L/j u . 
2: repeat 

3: L <-io*7«- 
4: Initialize c <- + A 2 . 
5: ft,(x) «- ^A(x, u ) + *(x) + £/(x; u , 0). 
6: x = z <- argmin x tfo(x). 
7: until J(x ) < min x g ( x o) 
8: for fc = 0, 1, . . . do 
9: L k +i <- L k /(^d * 7«). 
10: repeat 

11: L k+ i <- * 7u- 

12: Assign to afc + i the positive root (in a) of 

<t(1 — a)(cfe + A 2 a) + crAxa = Lk+\a 2 . 
13: Do step 7 to 12 of Algorithm 3. 
14: until J(x fc+ i) < g fc+ i(z fe+ i) 
15: end for 



Furthermore, (32) needs to be changed into 



?o(x)-J(x) < ( -f-^x ) A(x,u ). 



So we conclude for all fc > 1 and x <E dom^, 
J(x fe )- J(x) < (Z -A)A(x,u ) 



n ' 



1 



J i=l 



This bound does not involve the true L, and does not 
depend on ^ or the function value of / (which could 
be used to hide L). 

4.2. Bounding the duality gap 

It is also not hard to extend AGM-EF-1 to the same 
primal-dual settings as in Section 3.3. 

Using (30) and (31), we derive for all x £ dom^: 

9fc+i( x ) < (l-afc+i)<Zfc(x) + afc + i[^/(x; Ujt+i, Ai)+*(x)]. 

(33) 

This inequality allows us to express qu in terms of the 
linearizations of / at u^. For notational convenience, 
define 00 = 1 and 

fc 

bk(i) ■— en Y\ (1 — a j) f° r all < z < fc, 

3 =i+l 



then it is easy to see that X!i=o = 1 f° r all A; > 1. 



Lemma 17. For all x £ dom\E' and k > 1, 



Qk 



(x) < 6 fc (0)g (x) + J2 b k(*)P/( x ; Ui, Ai) + *(x)] 



i=i 



= -6 fc (0)A(x, uo)+*(x) + 5^6 fc (i)^(x; ^^^.(34) 



Proof. The inequality is obvious by inductively apply- 
ing (33). The equality is by the definition of go( x ) and 
the fact that 53i=o = 1- ' 

Go back to the settings of Section 3.3. We minimize 
J(x) by AGM-EF-1 and find some dual iterates ctk 
such that the duality gap J(xfc) — D(ctk) goes to 
fast. Similar to (22), we construct 



otk = y^6fc(»)a(ui)- 



(35) 



i=0 



Comparing with (22), we can see that both formu- 
lae are convex combinations of all the past a(uj) and 
higher weights are given to the later a(uj). Compu- 
tationally, Oik can be efficiently updated by recursion 

«o = a(uo), and a k +i = (l-afc+i) a t+ a Hi a ( u fc+i)- 

To be self-contained, we state and prove the counter- 
part of Theorem 10 here. 

Theorem 18 (Bounds on the duality gap). Suppose a 
sequence {xfc,Ufc,Zfc} is produced when AGM-EF-1 is 
applied to minimize J(x) by treating f as Xi-sc. Then 
the {otk} defined by (35) satisfies otk G Qi and 

J{x k )~D{a k ) < -M0) max A(x,u ). (36) 

(7 x£dom\I/ 

Proof. Since a(u,-) £ Q 2 and oik is a convex combi- 
nation of them, so ctk £ Q 2 - Clearly, (24) still holds. 
Denote the right-hand side of (36) as M. Now by using 
relationship (34) and Lemma 12, we have 

J(x fc ) < rning fe (x) 

X 

f L k 
< mini -6 fc (0)A(x,uo) + tf(x) + yV(i)£/(x;iii,Ai) 
x a i — ' 



i=0 



< M 



M + min I *(x) + J2 a(u<)) 



i=0 



<M + mm |*(x) + <j) [ x,^6 fc (i)a(ui) 
< M + £»(a fe ). 



4=0 
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5. Application to Regularized Risk 
Minimization 

Regularized risk minimization (RRM) is extensively 
used in machine learning. In this section, we describe 
and compare in theory many different ways of training 
these models by APM. The objective of RRM with 
linear models can be written as 



min J(w) = fi(w) + g*(Aw), 

wGQi 



(37) 



where Qi is a closed convex set. Here, Q(w) corre- 
sponds to the regularizer and is assumed to be A-sc 
wrt some prox- function di on Qi. d\ is in turn as- 
sumed to be cri-sc wrt a norm ||-|| on Q\ . Aw stands 
for the output of a linear model, and g* (the Fenchel 
dual of function g) encodes the empirical risk measur- 
ing the discrepancy between the correct labels and the 
output of the linear model (Aw) . Let the domain of g 
be Q 2 , which is also assumed to be closed and convex. 

Using the definition of Fenchel dual, the primal objec- 
tive (37) can be rewritten as a minimax problem: 

min max £(w, a.) :— f2(w) + (Aw, a.) — g(ot), (38) 

wGQl CtSQ2 

which further leads to the adjoint problem 



max < — g(cy) + min {(Aw, a) + f2(w)} 
max D(a) := -g(a) - Q*(-A T a). (39) 



It is well known [e.g. 29, Theorem 3.3.5] that under 
some mild constraint qualifications, the primal form 
J(w) and the adjoint form D(a) satisfy 

J(w) > D(a) and inf J(w) = sup D(a). 



This can be posed in our framework by setting 
Ql := w, A := —YX T , fi(w) = | ||w|| 2 , g*(u) = 
Ui — yib] + . This g* corresponds to 

(40) 



min beK i I 1 
3(a) = 



" Z)i a » if a G Qi 
-oo otherwise 



where Qi, the domain of g, is 

ft = { a e[O,«- 1 ]":^ ! / 1 Q, = 0}. 

i 

Then the adjoint form turns out to be the well known 
SVM dual objective: 



D(a) = J2 



1 



ati- — a 1 YX 1 XYol, s.t. a G Q 2 (41) 



Example 2: L\ regularized SVM. The primal 
form of the L t regularized SVM (Li-SVM, [30]) is: 



1 - 

J(w) =A||w|| 1 + mm- V[l-yi((xi, 

OEM Tl 



b)] 



This can be posed in our framework by using exactly 
the same configurations as above, except that now 
f2(w) = A || w|| j_ . One can show that f2*(v) = if 
|| v lloo ^ A, and oo otherwise. The adjoint form is: 



D(oc) 




if \\XY a \ L < A s± a e 
otherwise. 



(42) 



Example 3: multivariate scores. Joachims [31] 
proposed a max-margin model which directly opti- 
mizes the F\ score. Assume there are n+ positive ex- 
amples and n_ negative examples. -Fi -score is defined 
by using the contingency table: A(y',y) := 2a 2 ° +c . 



Let us see some examples in machine learning which 
have the form (37). Assume we have access to a train- 
ing set of n labeled examples {(x i; j/i)}" =1 where x.j e 
R p and yi G { — 1,+1}. Denote Y :— diag(yi, . . . , y n ) 
and X := (xi, . . . , x ra ). 

Example 1: binary SVMs with bias. The primal 
form of the binary linear SVM with bias is: 



J(w) = ^ || w|| 2 



1 



n 



i=l 



b)] 



7 In the sequel, ||-|| will stand for the L p norm. Since 
each space has a single prescribed norm and the space that 
a variable belongs to is clear from the context, we will not 
use || • || j to represent the norm on Q%. 



Contingency table. 5 . 





y=i 


y = -i 


y' = i 


a 


c 


y' = -i 


b 


d 



c = 



a = n + — b 
d = n- — c 



S(y l = l,y' l = -1) 
S( yi = -1,1^ = 1) 



(n = n + + n_) 
S(x) = 1 if x is true. Else 0. 

The primal objective proposed by Joachims [31] is 

(43) 

1 " 

A(y',y) + -]T( w > x *>(24-2/ 



b: false negative 
c: false positive 



J(w) = ^ || w|| 2 
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This can be recovered by setting Qi = M p , 0(w) = 
■| ||w|| , and letting A be a 2™-by-p matrix where the 
y'-th row is £™=i x 7(^-^) for each y' G {-1,+1}". 
Then g*{u) = max y / [A(y', y) + ^it y '] which is in- 
duced by 



where Q 2 , the domain of g, is 



-n 
-00 



EyA(y',y)a y - if a G Q 2 



otherwise 



(44) 



Here Q 2l the domain of g, is 



Q 2 = jae [O,^ 1 ] 2 " :5Z^ = ij. 
So we get the adjoint form 

D(a) = -^-a T AA T a + A(y' ,y)a r , a <E Q 2 . 



y' 



Example 4: Max-margin Markov Networks. 

The conditional random fields (CRFs) [32] and max- 
margm Markov network (M 3 Ns), [33] are also instances 
of RRM. First, they both minimize a regularized risk 
with a square norm regularizer. Second, they assume 
that there is a joint feature map <fi which maps (x, y) to 
a feature vector in MP. Third, they assume a label loss 
£(y, y l ; x 1 ) which quantifies the loss of predicting label 
y when the correct label of input x 4 is y\ Finally, 
they assume that the space of labels y is endowed 
with a graphical model structure and that </>(x, y) and 
£(y,y l ;x. 1 ) factorize according to the cliques of this 
graphical model. The main difference is in the loss 
function employed. CRFs minimize the /^-regularized 
logistic loss: 

A 1 " 

■*(w) = 2 imi 2 + - E lo s E ex pWy< y l ; x4 ) ( 45 ) 

-(w,0(x\y l )-0(x 4 ,y))), 
while the M 3 N s minimize the L 2 -regularized hinge loss 



A 1 n 

J(w) = -||w|| 2 + -^max{%,y*;x 



(46) 



-(w,0(x\y l )-0(x 4 ,y))}. 

Clearly, both cases employ Qi = M p and 57 (w) = 
I ||w|| 2 . With shorthand tp l y := </>(x l ,y 4 ) - </>(x\ y) 
and ^ y := ^(y,y l ;x l ), they both use an (n\y\)-by-p 
matrix A whose (i,y)-th row is (— ipy) T ■ For M 3 Ns, 
g*(u) — i J2i m ax y {£ y + w y } and it can be verified 
that the corresponding g is 



9(a) = 



y -y y 



-00 



if a € Q 2 
otherwise, 



(47) 



Q 2 =S n := jaG [0,1]"^ ■•E a y = ^' V z } • 

Clearly, Q2 is convex and compact. Now the adjoint 
form can be written as 

D(a) = -^a T 44 T a+^^f; a ;, a G S n . (48) 



For CRFs, g*(u) = ^ E>g E yei ; ex P(4 + 4)> and 
the corresponding g is 

n 

. E E a y( l0 g a y-4) +l0 S n if ° 6 ^2 
5(a) = ^ i=l y 

00 otherwise, 

(49) 

The domain of g is also Q2 = S n . Then the adjoint 
form is 



^(a) = --a T ^ T a + ^^a y (loga y -£ y ) (50) 

i=l y 

+ logn, a G <S™. 



Example 5: Entropy regularized LPBoost In 

[13], the entropy regularized LPBoost needs to mini- 
mize 

J(w) = AA(w,w°) + max(ui,w) , (51) 
ie[t] 

s.t. w G Qi := |w G [0,i/] n : ^ «;< = 1 j . 

Here ^ is a constant in [0, 1], w° G Qi is the uniform 
distribution, and A is the Bregman divergence induced 
by the entropy (i.e. A is the relative entropy), G M n 
is the so called edge vector. This objective corresponds 
to fi(w) = AA(w,w°), A = ( Ul ,...,u t ) T , g*(s) = 
max; Si which is induced by g(ot) = if a € Q 2 :— St, 
and 00 otherwise. Since 




0i 
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*(s) = - min I A log ^ exp 



so the adjoint form can be written as 
D(a) = -mm { A log £ «tf exp + + v £ 

subject to a £ Q2 = 5(. Here Aj denotes the i-th 
column of A. Although this form of D(a) is obscure, 
the strong convexity of VL implies that D(a) is Leg. 
The v is introduced by [13] to cap the density, and 
this cap is removed if v = 00. In that case, (3i in the 
definition of D(a) will all be optimized to and we 
recover the well known log-sum-exp formula of D(a). 
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Example 6: Elastic net Using square loss as an 
example of the empirical risk, the primal objective of 
elastic net regularization is 

J(w) = A (7 HI, + \ ||w||f) + I f> - xjw) 2 . (52) 

Here the L\ normalize!' is introduced to promote 

the sparsity of the solution. In this case, il(w) = 

A ^7||w||j + j llwll^) an( i ^ dual is left as an exer- 
cise for the reader. An equivalent formulation of (52) 
is by moving the regularizer into the constraint: 

1 " 

min J(w) = - ^2(Vi ~ x^w) 2 

2—1 

1 2 

s.t. 7 llwl^ + - ||w|| 2 < r. 

It can be shown that for any A > there exists an 
r > such that argmin J = argmin J and vice versa. 

There are also many regularized risk minimization 
problems which optimize over the space of positive 
semi-definite matrices, e.g. [2, 14, 34]. 

Summary From these examples, we can see the fol- 
lowing properties of f2 and g which will also be as- 
sumed for our general treatment of the objective (37) 
and (39). Firstly, the function f2(w) which serves as 
a regularizer is strongly convex. In Example 1, 3, 4, 
6, f2(w) is A-sc wrt the Euclidean norm. In Exam- 
ple 5, /(w) is A-sc wrt the L± norm. As a result, Q* 
must be \-l.eg on W. Secondly, the I. eg constant 
of n*(-A T a) in a also depends on the matrix norm 
of A, which in turn depends on the choice of norm 
on Qi and Q2- Thirdly, the g* is not necessarily dif- 
ferentiable {e.g., hinge loss), but g is always Leg on 
Q2- Finally, Q2 is bounded and its diameter can be 
well controlled. This is important for translating dual 
solutions into the primal. 

Our goal is to minimize J(w) over Qi, and we do not 
really care about solving the dual D(a) over Q2. How- 
ever, since D(ot) has favorable smooth properties, we 
also often work in the dual as a proxy. To solve J(w) 
(and D(a)), there are three main approaches. 

Smoothing g* to a fixed level. To handle the non- 
smoothness of g* , we can smooth it by using the tech- 
nique introduced by Nesterov [6] . Then the composite 
form, f2(w) plus the smoothed variant of <7*(^4w), fits 
the form of AGM-EF and can be solved in w (primal) , 
a (dual) or primal-dual. Given a prescribed accuracy 
e, g* only needs to be smoothed to a fixed extent. 



Smoothing g* with decreasing smoothness. 

[20] introduced a primal-dual method where g* is 
smoothed with decreased smoothness (i.e. increased 
closeness to g*). As a result, it tends to the optimal 
solution of D(a) and J(w), instead of just attaining a 
prescribed accuracy e. 

No smoothing. Given the smoothness of the dual 
problem D(ot), AGM can be applied to maximize it 
and then convert otk into by (22) and (35). No 
smoothing of g* is needed in this case. 

The next three subsections will describe these schemes 
in detail, with focus on the rates of convergence and 
how each iteration can be performed efficiently. More- 
over, we provide intuitions on which scheme is more 
suitable. For brevity, we will only use AGM-EF-00 
with fixed L as an example, while similar results can 
be straightforwardly derived for AGM-EF- 1 and adap- 
tive L. In this version of the paper, we illustrate all 
these ideas on Example 1 (SVM with bias). 

5.1. Smoothing g* to a fixed level 

A key technique introduced by Nesterov [6] was to 
tightly approximate the nonsmooth part g*(Aw) by 
a smooth surrogate. The idea of the approach origi- 
nates from the Theorem 21 in Appendix A which con- 
nects the strong convexity of a function and Leg of its 
Fenchel dual, g* is not Leg because g is not strongly 
convex, therefore to make g* smooth a natural idea is 
to add to g a strongly convex function (I2 on Q2 and 
then dualize it: 

0£(u) := (g + vd 2 )*(u) 

= sup {(o:,u) -g(a) -fj,d 2 {a)}. (53) 

aeQ 2 

Here fi > and d 2 is assumed to be CT2-SC wrt a norm 
on Q2- s By proper centering, d 2 can be assumed to 
satisfy 

min d2(a) = 0. 

a£Q 2 

Let us further define 

q.q — argmin d 2 (a), D := max ^(c*)- 

aeQ 2 aeQ 2 

The main restriction of this approach is that D must 
be well bounded. Using the definition in (53) we can 
easily characterize the uniform tightness of the ap- 
proximation: for all u £ Q2 

g*(u)-nD<gl(u)<g*(u). (54) 

8 We can also use the more general form of strong convex 
as in Definition 1. Here we use the conventional definition 
for simplicity. 
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Furthermore, the Leg constant of g*(Aw) in w wrt 
the norm on Q\ can be estimated as follows. By The- 
orem 21, g* is (fj,a 2 )~ 1 -l.c.g wrt the dual norm on Q 2 . 
So we can apply the chain rule: 



= \\A{Vg*(Aw 1 )-Vg*(Aw 2 )W" 

1 \\A\\ 2 

< \\A\\ ||Awi - Aw 2 \\* < - — — ||wi - w 2 | 

iia 2 fJ,a 2 



That is, <?*(Aw) is Leg in w with constant 



Lg(fl) < 



\\Af 
H<J 2 



(55) 



Example 1: smoothing the hinge loss. The 

hinge loss [1 — w]+ is the dual of g (a) = a for 
a G [—1,0] and 00 elsewhere. Adding ^a 2 to g and 
dualize it, we get 





1 — w 



if w > 1 
if w G [1 - n, 1] 
if w < 1 — (J, 



Some smoothed hinge loss g^(w) with various fi are 
plotted in Figure 1. 

Example 2: smoothing max into soft max. In 

the entropy regularized LPBoost, g*(s) = max^ Si and 
g{u) = if St and 00 otherwise.. Then adding prox- 
function Sj lns^ to 5 and dualizing it, we get 



g*(s) = /xln^exp(^j . 



When /i — > 0, this soft max recovers max. 

With the smoothed in place, we now discuss how to 
find an e accurate solution to J(w) by three different 
schemes: primal (w), dual (a), and primal-dual. 

5.1.1. Solving in the primal w. 

We will use <?* to define a new objective function 

J M (w) :=tt(w)+. 9 *(Aw) (56) 
= f2(w) + max {(Aw, a) — 3(a) — /U ^(ct)} . 

CtGQ2 



— /" = 





— M = 


0.2 


— /" = 


0.4 


---^ = 


0.6 




Figure 1. Smoothing hinge loss with different u. 

bounded everywhere by e, i.e. max w J(w) — J M (w) < e. 
By (54), this is guaranteed if is small enough 



(57) 



Plugging (57) into (55), we obtain that the Leg 

constant of 3* (Aw) is at most ^ A J[ e D ■ Let w* = 
argmin w J(w). Bearing in mind that ft is A-sc, AGM- 
EF-00 is readily applicable to J p (w) and the following 
rate of convergence can be inferred from Theorem 6: 



J^(wfc) - J M (w*) < A(w*,u )min • 



4D||A|| 



4D||A||- 




CT 1( r 2 e(fc + l) 2 ' 

-2fc+2 



\ \/4D||A||" 
Once Jfj,(\Vk) — J M (w*) < e, we must have 

J(w fe ) - J(w*) < J M (w fc ) - J M (w*) < 2e. 

Therefore, we obtain the following theorem. 

Theorem 19. For any given e > 0, setting il by the 
equality in (57) and applying AGM-EF-00 to J M (w), 
we can guarantee that wj, is a 2e accurate solution of 
J(w) as /on<7 as 



. 1 /4D||A|r A , 

fc > min{-\ ^A(w*,u ), 

e V crio-2 



(58) 



l + -ln 



4£>||A|| 
aia 2 e 2 



■A(w*,u ) /in 1 




Since J M (w) < J(w) for all w, to make sure that an e 
accurate solution to is a 2e accurate solution to J, 
a sufficient condition is that their deviation be upper 



Note ln(l + e) w e when e is close to 0, so the denom- 
inator in the second term becomes 0(y/e) and overall 

the second term is approximately O ( -4= In i J . The 
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first term does not depend on A. Note also that this 
bound does not explicitly depend on the diameter of 
Qi which is infinity in many cases. A closer look shows 
that A(w*,Uo) hides the dependence on A. With a 
small regularization parameter A, A(w*,Uo) may be 
large and could approach infinity when A tends to 0. 

Unfortunately the bound on the duality gap in (23) 
does use the diameter of Q±, and it cannot be replaced 
by A(w*,Uo) as in Theorem 19. Therefore, we do 
lose a termination criteria. Fortunately, this problem 
in duality gap can be avoided if we optimize in ex. 
Before describing it in detail, let us illustrate the above 
procedure on training the SVM with bias. 

Here, choose d\ and di as the Euclidean norm square 
and the norms on Q\ and Qi are both Euclidean norm. 
Then || A|| 2 = A max (A T A) = A max (AA T ) where A max 
stands for the maximum eigenvalue. o~\ = a<i = 1. 
The diameter of Qi is D < n\ — —. For a given e, 
set /j, = ne by (57). Suppose all Xj lie in the ball with 
Euclidean radius R. Then A max (AA T ) < nR 2 and the 
second term in (58) is essentially 



5.1.2. Solving in the dual a. 

Similar to J M in (56), we can also define a smoothed 
version of D(a): 

D^(a) := —fj,d,2(a) — g(ot) + min{f2(w) + (Aw, a)} 



-/id 2 (a) - g(a) - tt*(—A T a) 



(59) 



which is to be maximized over a € Qi. So we can 
pose —Driest) in the composite form, 

/(a) = g(a) + S!*(-A T a), and *(a) = /ids (a), 

to which AGM-EF-oo and AGM-EF-1 can be applied. 
Since Q* is l/X-l.c.g, f(a) must be Leg with constant 



J f 



A 



(60) 



where L g is the Leg constant of g. \l/ is /x-sc. Applying 
the primal-dual scheme in Section 3.3 with — and 
— Jfj, playing the role of J and D therein respectively, 
we get 




< O 



2R , 1 
— = ln - 



^u(wfc) - D M (a fc ) < max A (a, u ) • min 



a.eQ-2 



<Jl 



4L 



f 



fT 2 (fc + l) 2 ' 
-2fe+2 "| 



Solving in the primal is also advantageous in terms of 
the condition number. When g* is smoothed by small 
fi or when the regularization parameter A is small, the 
condition number c := L g (p)/\ becomes very large. 
According to Theorem 6, the number of iterations to 
find an e accurate solution is the min of 



O 



logi 



log 1 



and O 



Once J^(wfc) — Dfj,(a.k) < e, it is ensured that 
J(wfe) — min J(w) < Ju(wfe) + fiD — max D(a) 

w ' aeQ2 

< ^ M (w fc ) + e - D(a k ) 
<J^k) + e-D fl {a k )<2e. 

So we conclude the following theorem. 

Theorem 20. For any given e > 0, setting /x by the 
equality in (57) and applying the primal- dual scheme 
in Section 3.3 to —D^ and —J^, we can guarantee that 
Wfc is a 2e accurate solution of J(w) as long as 



So the linear convergence rate depends on c by 0(y/c), 
as opposed to 0(c) in most linearly converging algo- 
rithms, e.g. gradient descent. Second, the min in The- 
orem 6 implies that when A is very small and the ob- 
jective is very poorly conditioned, the linear conver- 
gence will be automatically superseded by the 1/y/t 
rate which has better "constant" . Some class of algo- 
rithms require manual rewiring in such a case, e.g. [25] 
and [35]. 

Finally, it is noteworthy that this method does not 
require g be Leg. 
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k > 




4M(||A|l 2 + AL g ) 



1 

2 In 



Acr 2 
ln 



4M(||A|| 2 + AL S ) 
Ao- 2 e 



1 



Ae<T 2 



4£>(|| J 4||' ! + AL S ) 



(61) 



where M := max ae Q 2 A (a, Uo). 

It is important to note that this scheme requires g be 
Leg, while solving in the primal does not make such 
a requirement. 
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Let us apply the scheme to SVM with bias, and use 
the same choice of norm and prox-function as before. 
Now L g — and M = 1/n. Using the approximation 



ln(l 



x when \x\ <C 1, (61) becomes 



k> 



2R 



- 1,1 



R 



Ac 



In 



4R 2 



As a final note, the way we smooth the empirical risk 
is different from [36] which changes hinge loss into 
square hinge loss or higher order. Our method has 
a smoothing parameter which trades smoothness for 
the tightness of the approximation. In contrast, the 
square hinge loss is just a heuristic approximation and 
no bound is available in optimization for its solution. 

5.2. Smoothing g* with decreasing smoothness 

A typical primal-dual solver for the objectives in (37) 
and (38) is the excessive gap technique [EGT, 20]. One 
concrete application is [37] where EGT is used to solve 
the above Example 4 (M 3 N and CRF). Unfortunately, 
EGT forces a fixed way to initialize Wo and cto. This 
is very inconvenient for homology and other warm- 
start techniques which utilize the closeness of solutions 
under small perturbations of the problem parameter 
(e.g. A). 

5.3. No smoothing of g* 

Since we assume g is Leg and f2 is A-sc, so the dual 
(39) is Leg and AGM-EF-oo is applicable. Since our 
ultimate goal is to minimize J(w) we adopt the primal- 
dual scheme in Section 3.3. The Leg constant of D is 
exactly the Lf in (60). Treating — D and — J as the J 
and D therein respectively, we get 



J(w fc ) - D(a k ) < 



4M(||A|| 



Xa 2 (k + iy 



When applied to SVM with bias where M 
L 



1/n and 



g — 0, we get that J(w/-) — D(atk) < e for all 
2R 



k > 



- 1. 



When comparing the rates, it is important to bear in 
mind that machine learning problems usually do not 
need a high accuracy solution and so e = 10~ 2 or 10~ 3 
might suffice. In many cases, A will be set to very 
small such as 10 -6 . Therefore - can be much smaller 
than j. Also, we are currently bounding \\A\\ 2 by nR 2 
which can be very loose in practice. The dependence of 
A(w*, Uo) on A is not clear either. Finally in practice 
when solving in the dual, the box constraints in SVM 
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can cause considerable waste of gradient computation. 
Therefore the rates above just provide limited guid- 
ance and the most appropriate optimization strategy 
has to be picked empirically. 

5.4. Efficient computation of the gradient 

So far, we have ignored the computational complexity 
per iteration which is dominated by two operations: 
computing the gradient and minimizing the model ipk 
in AGM-EF-oo (or q k in AGM-EF-1). We first show 
in this subsection that the gradient in all the above 
examples can be computed efficiently. Indeed, the gra- 
dients needed are ^g*(Aw) and -J^l* (— A T a) , with 
the former always being more challenging. So we focus 
on calculating ^^(iw). 

By chain rule, £-g*(Aw) = A T Vg*(Aw). Using [47, 
Theorem X.l.4.4], Vg*(u) can be computed by 

Vg*(u) = argmax (u, a.) - g(a) - fid 2 (a). (62) 

a£Q 2 

In the case of multivariate score (43) and (44), the 
dimension of the domain of g is exponentially high 
in the number of training examples, and therefore 
it will be intractable to first compute Vg*(Aw) and 
then pre-multiply it with A T (A has exponentially 
many rows). Similar tractability issues appear in 
learning with structured outputs as in M 3 N. Below 
we present a dynamic programming based algorithm, 
which costs 0(n 2 ) time and space complexity to cal- 
culate A T Vg*(Aw) for 

d 2 (a) = ^a, maj. 



In this case, the optimization problem in (62) is 



mm fj,^ a y , In a y , - A(y', y)oy 

y' y' y' 



Noting that the y'-th row of A is tpy := J2i(y'i~yi) x J > 
we get Uy> = ipyw = Yjiiv'i ~ 2/i) x 7 w - Following the 
standard procedures (e.g. [37, Lemma 8]), the optimal 
solution can be written as 

* 1 | 1 IT n a / / \ i 

where Z := V cxp [ - V y-x^w + -A(y', y) ] . 
So ay can be interpreted as a distribution over y' 
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fn = 



Er= n+ +i S (Vi = !) = c - For simplicity, denote 




fn = n+ — 1 



k - 1 jfe fc + 1 



fn = 2 



fn = 1 



• fn = 

n + 



y n+ 



Figure 2. Path weight interpretation of the normalizer Z 
and the marginal distributions p(y' k ). 



(normalized to — rather than 1). Then 



= E a *y'^y = E a *y' JIM - 

y' y' i 

= -2j2vi*i E a *r 



-Vi)ViXi 



where y' ~ —iji means summing up all y' whose i- 
th element y[ equals —y%. So J2 y >~- y a y> is exactly 
the marginal probability p{y' i = —yi) under the joint 
distribution a*/. Now we show how to compute the 
marginal distributions efficiently 

Unlike the inference in graphical models, there is no 
clique factorization in y'. Fortunately, {y^} are cou- 
pled only through the loss A(y',y) which in turn de- 
pends only on two "sufficient statistics" of y': false 
negative b and false positive c. For simplicity, we some- 
times also write A(y',y) as A(6, c). Without loss of 
generality, assume the positive training examples are 



the first n + ones (j/i 



= Vr, 



1), and the neg- 



ative examples are the last n — n + ones (y n+ +i = 
... = y n = -1). Denote y^ := {y[, . . . , y' n+ ) T and 
y'_ := (y' n++1 , . . . ,y' n ) T - y'+ ~ b represents that y^ 
commits b false negatives, i.e. Yl?=i = — 1) = b. 
y'_ ~ c represents that y'_ commits c false negatives, 



r, ;: : exp I -X fe W 



Let us first compute the normalizer Z as follows. 

ra+ ra_ / n 

z = zZzZ E E E^+>'y) 



6=0 C=0 y^ly^c 



/./, 



E E ex p £ A ( & ' c ) E ex p - E 

6=0 c=0 ^ /y'^fe \ M i=l / 



= :V+(i>) 

• E ex p ^ E ^ w 

y'_~c \ i=n++l y 

v v ' 

= :V-(c) 

Therefore, once we have V+(b) for all b £ [n + ] and 
V-(c) for all c G [«■-], then Z can be computed in 
n + n^ steps. For simplicity we only show to compute 
V+ (&), and V- (c) can be computed in exactly the same 
way. 

For each fixed b, V+(b) can be equivalently reformu- 
lated by Figure 2. Each node (k, /) represents that y' 
has committed / false negatives in the first k exam- 
ples: Yli=i fi(y'i = — 1) = /• Each node is connected 
to two nodes on its right: (k + 1, / + 1) and (k + 1, /). 
The former corresponds to y' k+1 = —1, i.e. one more 
false negative is committed. So we attach to the di- 



The 
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agonal edge a weight exp ^— ^x[ +1 \. j — ^ k+1 . 

latter means y' k+1 = 1 and the false negative is not 
incremented. So the horizontal edge is attached with 
weight Ck+i- A path from (k, /) to (k' , /') (k < k' and 
/ /') is a sequence of nodes moving from (/c, /) to 
(k', /') along the edges of the graph: (k, fo) — (k, f) — s- 
(k+l, / x ) -> . . . -> (k+s, f s ) = (k' , /') where s = k'-k 
and fi+i — fi = or 1. The weight of a path is defined 
as the product of the weight of all edges on that path. 

Clearly V+ (b) is equal to the total weight of all paths 
from (0,0) to (n + ,b). To compute it, define a k (v) 
as the total weight of all paths from (0,0) to (k,v). 
Then it is not hard to see the following recursion for 
all k = 1, ... , n + and v — 0, 1, . . . , k: 

atk{v) = — a k -i{v - 1) + c k a k -i(v) 7 (63) 

where a k (— 1) :— and a k (k + 1) := for all k. Al- 
gorithm 5 computes V + (b) — a n+ (b) for all b € [n+]. 
Clearly the computational cost is 0(n^_). If we only 
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Algorithm 5 Forward propagation to compute all 
{V+(b) : < b< n+}. 

1: Initialize ao(0) = 1- 

2: for k = 1, . . . , n+ do 

3: for v = 0, 1, . . . , k do 

4: ak(v) = ian(»- 1) + c k a k -i(v). 

5: end for 

6: end for 

7: Return: V+(b) = a n+ (b) for all < b < n+. 



Algorithm 6 Backward propagation to compute 

Piv'k) for a11 k e [»+]■ 

1: Initialize £ n+ (v) — r\- (v) for all v = 0, 1, . . . , n+. 

2: Z n+ = c„ + E"=o a »+-i( u )€«+( u )- 
3: for k = n+ — 1, . . . , 1 do 
4: for u = 0, 1, . . . , k do 

5: = C k+ itk+l{v) + -^£ k+1 (v + l). 

6: end for 

7: Z k = c k Y^Zlu k -i(v)£ k (v). 
8: end for 

9: Return: p(y' k = 1) = ^ for all e [n+]. 



need V+(6) then the space complexity is 0(n+). But 
later we will need all a k (v) so we keep 0(n^_) mem- 
ory. Taking into account the similar cost for V— (c), the 
total spatial and computational cost is both 0(n 2 ). 

To compute the marginal distributions p(y' k ) we need 
a backward propagation. For example let us consider 
p(y' k — 1) for k £ [n+], and the case of k > n + (neg- 
ative examples) can be dealt with similarly. By the 
definition of a y < , it suffices to compute 

Z k := ]T exp(i^y,:x7w + -A(y',y)] 

= EE ex PU A(M E ex p -E^ x M 

6=0 c=0 VM y+~b,4=i \ i=l / 

• E ex p ( ~ E ^ w I 

y'_~c \ j=n + +l y 



n + / 1 ™ + 

E E ex p -E^ 

6=0y' ~6,y'=l V J=l 



x7w 



53exp(-A(6,c))y_(c). 



=:»7-(6) 

Since V^-(c) available from forward propagation, 
{i]-(b)} can be computed in 0(n 2 ) time. So the only 
problem left is to compute T+(b). T+(b) has a very 
intuitive interpretation in Figure 2: the total weight 
of all paths from (0,0) to (n + ,b) with the k-th step 
(i.e. between the horizontal coordinate k — 1 and k) 
going horizontal (not diagonal). Let fi\{v) denote the 
total weight of all paths from (fc, v) to (n + , 6). Then 



fc-i 



So 



"+ fc-1 



6=0 6=0 f=0 

= c^a fc _ 1 (v)^^(v)ry_(6). 



u=0 



6=0 



Therefore as long as can be updated efficiently, 

so is Z k . Fortunately, P k (v) has a recursive form 

p b k {v) = c k+1 [3 b k+1 (v) + —f3 b k+1 (v + 1), 

Cfc+l 

for all < k < n+ - 1, < v < k and < b < n+. 
This implies for all < k < n + — 1 and < v < k 



6(«)=E4(^-W 

6=0 



6=0 



Cfc+l 



D = 
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= c fe+ i^ fc+ i(w) -I £fc+i(u + !)• 

Cfc+l 

The final algorithm is summarized in Algorithm 6. Its 
time and space cost is both 0(n ). The initialization 
of therein is based on initializing /3 T fc l+ (v) = S(v = b) 
for all b, v = 0, 1, ... , n+. 

The gradient of g*(Aw) for M 3 Ns can also be com- 
puted efficiently by dynamic programming, but the 
key structure it exploits is the clique decomposition 
in graphical models. Details can be found in [37]. 

5.5. Minimizing the model efficiently 

In this section, we show that the model ip k can be 
minimized efficiently 
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5.5.1. Diagonal quadratic constrained to a 
box and a hyperplane 

When AGM-EF is applied to solve the dual optimiza- 
tion problem D(a) for SVM in (41), each iteration 
needs to solve the model subject to Q2- This can be 
reduced to a box constrained diagonal QP with a single 
linear equality constraint: 



1 

min- ^^ 2 (a z - m^ 2 (64) 

i=i 

s.t. li <ai < Ui Vi G [n]; 

n 

^VjCHj = z. 

i=i 

Similarly, when solving in the primal with smoothing 
in (56), the gradient query also involves an optimiza- 
tion in this form. In this section, we focus on the 
following the QP in (64). The algorithm we describe 
below stems from [38] and finds the exact optimal so- 
lution in O(n) time, faster than the 0(n log n) com- 
plexity in [39]. [39] also proposes a median finding 
based algorithm which has linear time complexity in 
expectation. In contrast, our method is deterministic 
and linear. Liu and Ye [40] tackle this problem too, 
but they use the mean bisection and apply Newton's 
method to find a solution up to an inexact prespecified 
accuracy 8. The resulting total cost is 0(n log |). 

Without loss of generality, we assume li < Ui and di ^= 
for all i. Also assume Ui ^ because otherwise 
cti can be solved independently. To make the feasible 
region nonempty, we also assume 

z > ^2 a i( S (<n > °)k + <K°« < °H) 

i 

and z < (Ji{8{(Ti > 0)ui + S(ui < 0)k). 

i 

With a simple change of variable Pi — <Ji(pn — mi), the 
problem (64) is simplified as 



1 



S.t. 



1=1 

I'i <Pi < < Vt G [n]; 

n 

$> = 

i=l 



where <B = 4, U = 



Oi(li - m,i) if Ui > 

Oi(ui - mi) if Ui < 

Ui{ui~mi) ii Ui > , , _ 
Ui(k - mi) if u t < 




Figure 3. hi(X) 



Write out its partial Lagrangian: 



- 1 / - \ 

min max ) Trdfdf — A > & — z 1 \ 

1 1 i=l \i=l / 



Due to strong duality, we can swap the min and max: 



1 

nia.N mil) > -dffif — A 




AGR /3iG[i',«'l ^ 2 

= maxy^ min ( —<BBf — Aft ) + \z' 

AGR ^ftG [/',«'] \2 

^min'V max [ --dffif + Aft ) -Az' (65) 

AGR^ftGK,«'J V 2 ! 



:=Hi(A) 

Clearly, the optimal /3* (A) in the definition of Hi (A) 
can be solved analytically, and this gives 



Hi (A) 





dfuf + Xu\ 


if A > u^df 






if A < l<d? 






if ag [^Xd 2 


IX 

< I'i 


if A > ^rf 2 
if A < Z^d 2 




A 

I 2d? 


if a g ^ 





with /3*(A) = < 



To minimize the objective in (65) as a function of A, we 
notice that Hi(X) is convex and differentiable. Thus, 
the minimizer of (65) is exactly the root of its gradient. 
Note the gradient of Hi. 



; if A > 
h i (X) = {l' i ifA<Z<<F 



£ ifAGKd 2 ,^" 2 ] 
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See Figure 3 for the plot of hi(X). So we need to find 



Regularized Risk Minimization by Nesterov's Accelerated Gradient Methods 



min S 




dlii< 



max S 



(a) minS < < dju' t < max 5 



min S max S 

(b) min S < < maxS < dfw- 



{----}-■ 

min S* max S 
(c) rf?^ < min 5 < d]u[ < maxS 




min S max S" 

(d) minS < max 5 < dfZ- 



min S max 5* 



(e) cfil'i < minS < maxS < d%u'i (f) d?M; < min 5* < max S 

Figure 4. All possible locations of minS and max 5 on hi(X). 



the root of the gradient of (65): 

n 

/(A) :=J2hi(X)-z' = 0. (66) 

Note that /ii(A) is a monotonically increasing function 
of A, so the whole /(A) is monotonically increasing in 
A. Since /(oo) > by z' < ^ < and /(-oo) < by 
z' > X^i^j the root must exist. Considering that / 
has at most 2n kinks (nonsmooth points) and is linear 
between two adjacent kinks, the simplest idea is to sort 
{Jfl't^u'i : i € [n]} into s« < ... < . If /(s«) 
and /(s^ +1 ^) have different signs, then the root must 
lie between them and can be easily found because / 
is linear in [s«,s' ,+1 ']. This algorithm takes at least 
O(nlogn) time because of sorting. 

However, this cost can be reduced to 0(n) by mak- 
ing use of the fact that the median of n (unsorted) 
elements can be found in 0(n) time. Notice that due 
to the monotonicity of /, / evaluated at the median 
of a set S is exactly the median of function values, 
i.e., /(MED (5)) = MED({/(x) : x £ S}). Algorithm 
7 shows the binary search. Let \S\ denote the cardi- 
nality of S. The while loop must terminate in order 
log 2 (2n) iterations because in each iteration the cardi- 
nality of set S is reduced to at most ^ + 1 (we will 
call it "almost halves"). So if /(to) can be evaluated in 
0(15*1) time, then the time complexity of each iteration 
is linear in and the total complexity of Algorithm 
7 is 0(n). Step 7 and 9 ensure that |5| = 2 at step 12. 

The evaluation of f(m) potentially involves summing 
up n terms as in (66). However by carefully aggregat- 



Algorithm 7 0(n) algorithm to find the root of 
/(A). Do not allow duplicate points in S. 



1 


Initialize kink set S <— {dfl' i ,d'^u' i :i& [n]}. Re- 
move duplicates if any. 


2 


while |5| > 2 do 


3 


Find the median of S: m MED(S') 


4 


if /(m) = then 


5 


Return m. 


6 


else if /(to) > then 


7 


S^{ieS:i< to}. 


8 


else 


9 


S <— {x G S : x > to}. 


10 


end if 


11 


end while 


12 


Return lf j$^f§ XS = {l,u: f(l) + /(it)}, or 
any value in [I, u] if S = {I < u : f(l) — f(u)}. 



ing the slope and offset, this can be reduced to Od^l) 
too. In more detail, let us first consider all the possible 
locations of min S and max S on hi(X) as illustrated 
in Figure 4. By halving the set S, the possible trans- 
fers of situation are shown in Figure 5. Once the set 
S gets into the states (d), (e), (/), its state will never 
change with the shrinking of S, and the contribution 
of hi(X) to /(A) will be determined by: ^ for case 

(d) , v! i for case (/) and ^ for case (e). So we keep 

two buffers: c s £t which aggregates the contribution 
by all the hi ending in state (d) or (/), and s g € K 
which aggregates the slope -h for all hi ending in state 

(e) . In other words, to evaluate /(to) we only need to 
visit those hi which are still in state (a), (b) and (c) 
(called undetermined states). But how many such i 
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Figure 5. All possible transitions of state. 



can there be? By Figure 4, these hi all contribute at 
least one kink point in S (state (a) contributes two). 
If \3 2 i l' il 3^v! i : i £ [n]} are distinct, then the points in 
S has one-to-one correspondence to the kink points of 
hi. Therefore, the number of hi in undetermined states 
must be upper bounded by the size of S. Since the size 
of S almost halves in each iteration, so is number of hi 
in undetermined states. As a result, the cost for com- 
puting /(to) halves too. Overall, running Algorithm 7 
to completion, the total time spent on evaluating /(to) 
in step 4 is 0(n). 

The analysis becomes a bit more complicated when 
{dfZ^dfu- :i£ [n]} contains duplicate points. In this 
case, one point in S may correspond to kink points of 
multiple hi, and so the above argument can no longer 
be used to upper bound the number of hi in unde- 
termined states. The simplest patch is to add small 
perturbations to the duplicate points and make them 
different. A more principled solution is given in Algo- 
rithm 8. The key idea is to allow duplicates in S, and 
replace S <— {x £ S : x < m} in step 7 of Algorithm 
7 by S {x £ S : x < to} (and similarly step 9). An 
additional level of if-then-else check is introduced so 
as not to miss out the solution. Clearly, the size of S 
still halves in Algorithm 8. More importantly, because 
we do allow the duplicates in S, so the size of S is an 
upper bound of the number of hi which is in undeter- 
mined states. Therefore, the cost for computing /(to) 
and f(y) halves through iterations, and the total time 
spent on evaluating /(to) and f(y) is O(n). 

Note that the duplication removal in Algorithm 8 ac- 
tually cannot be done in 0(n) time, and is subject 
to numerical precision. In our experiment, we used 
Algorithm 8 which does not remove duplicates. The 
correctness is easy to prove, and in practice there is 
almost no duplicates and it works very well. 



Algorithm 8 0(n) algorithm to find the root of 
/(A). Allow duplicate kink points in S. 

1: Initialize kink set S <— {dfl^dfui : i £ [n]}. Keep 

duplications and so \S\ = 2n. 
2: while \S\ > 2 do 

3: Find the median of S: m <- MED(S). 

4: if /(m) = then 

5: Return to. 

6: else if /(to) > then 

7: Find y := max{i £ S : x < to}. 

// {x £ S : x < m} must be nonempty. 
8: if f(y) > then 
9: S {x £ S : x < to}. 

10: else 

11: S <— {y,m}. II Root lies in [y, m], so exit 

the while loop immediately. 
12: end if 
13: else 

14: Find y := minjx £ S : x > m}. 

// {x £ S : x > m} must be nonempty. 
15: if f(y) < then 
16: S ^ {x £ S :x > to}. 

17: else 

18: S <— {to,?/}. // Root lies in [m,y], so exit 

the while loop immediately. 
19: end if 
20: end if 
21: end while 

22: Return '^g^ff if S = {l,u : f(l) ? /(«)}, or 
any value in [I, u] if S — {I < u : f(l) — f(u)}. 



5.5.2. Elastic net 

For the first type of elastic net (52), the composite 
optimization is easy thanks to the separability. The 
second type which uses constraints is much more chal- 
lenging, and we show in this section how to solve this 
constrained optimization in linear time. Our approach 
is similar to the previous Section 5.5.1. 

At each iteration of AGM-EF-oo or AGM-EF-1, we 
need to solve 




Since all dimensions of w are decoupled, each u>i can 
be solved separately as a one dimensional optimization 
problem. In fact, its solution enjoys a simple closed 
form [41, p. 384]: 



[ Lgi-i\ 
— ) A+L 



if A < Lgi/j 
if A > L ffi /7 
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A more difficult version of elastic net is based on con- 
straints, where in each iteration one needs to solve 



min i||w-g|| 2 

w z 

s.t. 7l|w|| 1 + i||w|| 2 <A. 



(67) 
(68) 



Clearly the optimal Wi has the same sign as ft, hence 
we can assume ft > without loss of generality. Next 
we follow the same idea as in Section 5.5.1 and refor- 
mulate (67) into a one dimensional root finding prob- 
lem. First write out the Lagrangian: 



min max ^ || w - g|| 2 + A h^w^ + ^ ||w|| 2 - r 
maxmini ||w- g|| 2 + A + ^ ||w|| 2 -r 



~/x(w) 

where the equivalence is based on a simple check of 
Slater's condition. For each fixed A, the optimal w 
can be found by setting the subgradient to 0. 




Therefore, the optimal solution is 

1 10 if A > ft/7 



(69) 



Plugging it back to /a(w) we get the one dimensional 
optimization problem in A: 



H(X) = -rA 7 + 



V J 2(1+A) — W7 

if A > ft/7 




It is easy to see that H(X) is concave in [— l,oo). So 
its maximizer is or the root of its derivative. 



H'(X) 



where 



-r"f - 



v 



i=i 







if A > ft/7 



2(A+1)2 



2(A+1) 2 



h(X) 



h(\) = -2 ir (\ + iy 




7 2 (A + l) 2 + (7 + ft) 2 if A < ft/7 
if A > ft/7 



Clearly h(X) is monotonically decreasing in |0,oo). So 
H(X) is maximized at if h(0) < 0, i.e. 



p 

' i=l 



P 1 \ - 



Otherwise, h(X) has a root in [0, oo). Since it monoton- 
ically decreases, the binary search trick in Section 5.5.1 
can also be applied here. Once it is determined that 
the optimal A is less than a set of ft, these quadrat- 
ics can be aggregated by summing up the 7 ^ft + ^ • 
Finally, w is recovered by (69). 

5.6. Optimizing the Prox-function 

When smoothing g* , we have often used prox-function 



However, it is possible to improve 



the condition number by using an optimized prox- 
function. This idea was used by [17] where the Leg 
constant of a quadratic |x T Hx + (b,x) (x £ MP) 
is upper bounded by p when the norm is chosen as 
||x|| 2 = J2i^a x h i- e - rescaling all dimensions. 

Using this idea, we show in this section that a data 
dependent optimization of the prox-function can im- 
prove the condition number of the smoothed variant 
of the primal objective as discussed in Section 5.1. 

Let us consider the following simple but illustrative 
example. Suppose Q 2 = [0, c] n , g(u) — — E™=i u «- 
Denote A = (ai, . . . , a„) T . We adopt a prox-function 



n 

d{u) = \^ul 



and we can derive g* = {g + fj,d)* . The diameter of Q2 



under d is 



D = max d(u) = — h 1 



(70) 



For any prescribed accuracy e > 0, we first choose /i 
such that y,D < e, i.e. [/, < Then our goal is to find 
the bi which minimizes the Lipschitz constant of the 
gradient of g*{Aw) wrt w. 

First compute j*(Aw): 

g* (Aw) = sup (Aw, u) + V Ui - £ J2 b i u i- 



It is easy to see that the optimal u* is 

(a,-,w) + 1 



u* = MED 0, c, 
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where MED stands for the median. So the gradient of 
g*(Aw) wrt w can be calculated by 



d_ 
9w 



9»( Aw ) =^2 gi ' where Si = < a 



if u* = 

if u* = c 
^0±ia, else 



So the Hessian of g* (Aw) in w can only take value in 
H 5 



-y^^aia^, where Si € {0, 1} . 



Now for any wi,W2 € Qi, denote I = ||wi — W2II 
and v = (W2 — Wi)/1 (so ||v|| = 1). Denote h(t) = 
&<£(A(w 1+ tv))- So 

W&S'M^) - &gZ(Av 2 )\\ = JJh(f) - h(0)|| 

||wi - w 2 | 

(°) f&i ( c ) 

< ||Vh(0|| = ||J?«v|| < A max (ff (5 ). 



Z 



Here, (a) is by the mean value theorem with £ € [0, 1]. 
(b) is by the chain rule and the S for Hg is determined 
by £• (c) is because for any real positive semi-definite 
matrix H, max|| v || =1 ||iJvj| = X max (H). 

Clearly X max (Hs) is maximized when all 5^ = 1 and 
let us call it H\. In conjunction with (70) and (57), 
we minimize A max (i?i) wrt be 



minA max (Fi) = min— A max 

hi b, e 



D 





— max mm 

2e l|vl|=i h 



2J6™v) a 



(71) 



— max , / 

2e ||v||=l \ 



'a^vl (Cauchy-Schwartz) (72) 



max | a i~ v | 



Note [42] used the heuristic that b\ = Ha^l^. We can 
also compare with the isotropic d, i.e. bi = 1. Simply 
plug bi — 1 into (71), and we get 



2c 



which must be greater than or equal to 



2c 



El a H 



in (72) for all v. Therefore with a fixed e, our approach 
does possibly reduce the Leg constant of j*(Aw) in 
w. The maximum eigenvector can be found very effi- 
ciently by using the power iteration, and usually 5 to 
6 iterations is enough. 

6. Experimental Results 

We will present the experimental result in a later ver- 
sion. 

7. Discussion 

A lot of efforts (e.g., [43, 44]) have been devoted to 
making Nesterov's method online, i.e. use a stochastic 
gradient oracle and preserve the rate of conver- 

gence for the expected gap. This however turns out 
hopeless as was shown by the lower bounds in [45, 46]. 
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A. Concepts from Convex Analysis 

The following four concepts from convex analysis are 
used in the paper. 

Definition 2. Suppose a convex function f : W — > 
K is finite at w. Then a vector g G K n is called a 
subgradient of f at w if, and only if, 

J(w') > /(w) + (w' — w, g) for all w'. 

The set of all such g vectors is called the subdifferen- 
tial of f at w ; denoted by 9 w /(w). For any convex 
function f, <9 w /(w) must be nonempty. Furthermore 
if it is a singleton then f is said to be differentiable at 
w, and we use V/(w) to denote the gradient. 

Definition 3. A convex function f : E™ — > K is 
strongly convex with respect to a norm \\-\\ if there ex- 
ists a constant a > such that f — ^\\ ■ || 2 is convex, 
a is called the modulus of strong convexity of f, and 
for brevity we will call f a-strongly convex. 

Definition 4. Suppose a function f : M™ — > K is dif- 
ferentiable on Q C R™. Then f is said to have Lips- 
chitz continuous gradient (Leg) with respect to a norm 
|| ■ || if there exists a constant L such that 

||V/(w) - V/(w')||* < £||w- w'|| V w,w' e Q. 

For brevity, we will call f L-l.c.g. 
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Definition 5. The Fenchel dual of a function f : 
R" -> R, is a function f* : 1" -> R denned by 

/*(w*)= sup {<w,w*)-/(w)} 

Strong convexity and /.c.g are related by Fenchel du- 
ality according to the following lemma: 

Theorem 21 ([47, Theorem 4.2.1 and 4.2.2]). 

1. If f : R" — > R is a-strongly convex, then f* is 
finite on R™ and f* is —-Leg. 

2. If f : R™ — > R is convex, differentiate onW 1 , and 
L-l.c.g, then f* is ^-strongly convex. 

Finally, the following lemma gives a useful characteri- 
zation of the minimizer of a convex function. 

Lemma 22 ([47, Theorem 2.2.1]). A convex function 
f is minimized at w* if, and only if, 6 <9/(w*). Fur- 
thermore, if f is strongly convex, then its minimizer 
is unique. 
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